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[57] ABSTRACT 

A method and system for fast indexing and searching of text 
in compound-word languages such as Japanese. Chinese. 
Hebrew, and Arabic. Computer codings of such compound- 
word languages often contain different character types. e.g. 
the shift- JIS coding of Japanese rq>resents kanji. katakana. 
hiragana. and roman characters with different codings in the 
same character set* to form index terms and search terms. In 
a preferred embodiment, a content-index seardi system is 
invoked in response to a query on a collection of objects. 
The collection of objects is indexed by the content-index and 
may, for example, be a coipus of documents indexed by the 
terms contained in the documents. A content-index search 
system uses the content-index to generate and store an initial 
search result in response to the query; a direct search system 
is used in certain situations. The content-index contains, for 
eadi of a plurality of terms, a reference to each object. The 
content-index is created by first creating a preiiminary index 
term for each plurality of terms delimited by a word sepa- 
rator or a diaractcr type transition in a string of characters 
to be indexed. For each preliminary index term of a first 
type, e.g. katakana or roman. the preliminary index term is 
utilized as an index term. For each preliminary index term 
of a second type. e.g. kanji, the preliminary index term is 
step-indexed to create a plurality of index tems of a length 
less than a predetermined step size. The index tcmis are then 
added to the content-index in association wi^ the object 
being indexed. A string of text entered into a search engine 
as a search term is processed into preliminary search terms 
and search terms in a similar manner. 
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METHOD AND SYSTEM FOR FAST 
INDEXING AND SEARCHING OF TEXT IN 
COMPOUND-WORD LANGUAGES 

TECHNICAL FIELD 

The present invention relates generally to a computer 
method and system for computer-based text indexing and 
searching, and. more specifically, to a computer method and 
system for conducting computer-based indexing and search- 
ing of text encoded in con^und-word languages, that is, 
languages having words that are run together or lack inter- 
vening word separators, particularly. Japanese, Chinese, cr 
other Eastern languages. 

BACKGROUND OF THE INVENTION 

Existing computer systems provide the capability to 
search a collection of documents to identify those docu- 
ments that contain a certain word, or phrase, or a combina- 
tion of words. For exan^le, given a collection of documents, 
the computer system can return a list of the documents that 
contain the word '^patent'* or can return a list of the docu- 
ments that contain the phrase **patent application.** In 
addition, the computer system can return a list of the 
documents that either contain the word **patent** or contain 
the word "application." This list includes those documents 
that only contain die word **patent." those that only contain 
the word "application."* and those that contain botfi words. 

Such computer systems also provide the ability to effi- 
dentiy find or retrieve documents in response to such 
queries by indexing the contents of the documents. Indexing 
information is typically stored in a structure referred to as a 
content-index. A content-index typically indexes multiple 
documents and includes indexing data (e.g.. keywords) and 
reference data that refers to the docunfients that contain the 
indexing data. For example, a typical content-index may 
store as the indexing data each major term contained in each 
document Each tenn is stored as a separate entry in the 
content-index and each entry contains a reference to the 
documents that contain that term. Thus, a content-index can 
be used to determine which documents contain a particular 
term 

A content-index is typically stored in an efficient data 
stnjcture. such as a hash table or binary tree (B-tree). so that 
information can be retrieved efQciently in response to que- 
ries. A typical conteat-itKlex can be used to answer simple 
queries involving the use of an indexed term veibatinL, as a 
prefix, or as specifying a range. For example, if die indexed 
terra is the word "second.^ then the content-index can be 
used to find all documents that contain the word "second.** 
Also, the content-index can be used to find alt documents 
containing the word "second" as a prefix, by using a query 
term such as "second***, where "♦** is a wildcard character. 
For example, a document containing the word "secondary" 
would satisly (match) the query. Also, for exanqiie. if the 
indexed term is a range, for example, "second — fourth," 
then a document containing the word **lhird" would satisfy 
the query. Such queries involve a $inq>le lookup of the term 
in the content-index and the retrieval of the set of documents 
that contain the indexed term or a term within the specified 
range. 

Each different language lends certain complications to 
creating content-indexes or searching. For exaixqjle. written 
Japanese consists of a mixture of several types of symbols, 
each with its own function. The kanji charaaers are 
ptcturegr^>hic-idiogrq)hic characters adopted from the Chi- 
nese language, and are used for concq)tual words and 
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indigenous names. The kana, which consists of hiragana and 
katakana. are phonetic symbols developed in Japan. Each 
symbol represents the sound of one syllable. Hiragana is 
used to write the inflectional endings of conceptual words 
5 written in kanji, as well as types of native words not written 
in kanji. while katakana are used chiefly for words of foreign 
origin. Besides these symbols, one often finds roman letters 
and arable numerals in Japanese text 
As with other languages, symbols representing characters 
10 in the Japanese language can be represented by a series of 
bits, in a manner similar to that in which an 8-bit byte is used 
to represent ASCII charactas. Given a particular coding 
scheme, a table can be constructed that translates a given 
code into an ^ropriate character of the language for 
1^ display on a display screen or printing by a printer. 

Japanese* Chinese, and odia languages have far more 
symbols than English; the mere addition of numbers of 
symbols affects indexing and seardiing efficiency. With 
Japanese, as of 1981 the Japanese language consisted of 

^ approximately 1,900 kanji characters, in addition to the 46 
hiragana characters and 46 katakana characters. This large 
nuird^er of characters cannot be encoded with a single 8-bit 
byte, so more coit^lex encoding schemes are required. 

A popular encoding scheme for Japanese is called shift- 
JIS (Japanese Industrial Standard). The shift-JIS Japanese 
code is an 8-bit code which is primarily used for internal 
processing of Japanese on various computer platforms. 
Details about the relationship between the 8-bit codes and 

^ the Japanese symbols may be found in the document entitied 
Understanding Japanese Information Processings by Ken 
Lunde (O'Reilly & Associates, Inc., 1993). ISBN 
1565920430. Which is incorporated herein by reference and 
made a part hereof. 

As can be see in the referenced publication, there are both 
one-byte-pcr-character and two-byte-per-character modes in 
the shift-JIS representation. The two-byte-per-character 
mode is initiated when a byte with a decimal value ranging 
between 129-159 or 224-239 is received. Either of these 

^ bytes are subsequenUy treated as the first byte of an expected 
two-byte sequence, llie following or second byte must be a 
byte with a decimal value ranging between 64-252 (but not 
127. the delete DEL character). Note that the first byte's 
range falls entirely in the extended ASCH character set 
which are true 8-bit characters. 

Thus, the shift-JIS encoding scheme for Japanese, which 
consists of one or two-byte sequences, can represent kanjL 
katakana, hiragana. and roman diaracters ("character 
types**). However, a search within or an index to a Japanese 

so text file encoded in shift-JIS is c6n4)licated by the fact that 
there are no intervening word separators. 

In a typical search, a user wants to find documents that 
contain one or more words or word level concepts. In most 
languages, words are consistenUy separated by word scpa- 

53 rator characters such as a space, oonuna, period, etc.. and 
hence are easy to identify. However, in compound word 
languages such as Japanese and Chinese, words are not 
reliably separated by word separator characters. A string of 
katakana or kanji characters in Japanese, for example. 

60 typically contains two, three or more symbols with no word 
separator diameters in between. Native speakers use their 
knowledge of word meaning and context to figure out where 
the word boundaries are. 
The large number characters in languages such as 

65 Japanese and Chinese, coupled widi the difficulty of isolat- 
ing words for purposes of indexing or searching, creates 
significant demands on computer resources in terms of disk 
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space, memory, processing time. The present invention 
seeks to provide an efficient method and system for indexing 
and searching on documents encoded in such con^und- 
word languages. 

SUMMARY OF THE INVENTION ^ 

The present invention provides an improved con^uter- 
bascd method and system for efficient indexiDg and search- 
ing of objects such as files, documents, or a collection of 

documents represented in a code for a conqx)und word lo 

languages such as J^>anese or Chinese, especially encodings 
such as shift- JIS that uses the same code represent different 
character types within a given character set The method is 
not dictionary-based and requires no special understanding 
of the language being indexed or searched. is 

In the preferred environment for indexing, the system 
creates a content-index that indexes a collection of objects; 
this index is stored in a computer system. 

In response to a query, an initial search result Is generated 
using the content-index and stoo-ed. The stored search result ^ 
contains references to objects in the collection that match a 
search criteria specified in the query. Once the search result 
is initially generated using the content-index, the results (i.e. 
a list of files or documents that satisfy the search criteria) are 
displayed to the user. In other cases, a direct search (that is. ^ 
a search that compares an symbol search key to the entire 
contents of each file or document, taking groupings of n 
symbols) is conducted. 

In accordance widi the invention, a content index is ^ 
aeated by generating a reference to each object that contains 
an index term by first creating a preliminary index term for 
eadh of a plurality of terms. A term includes only characters 
that can occur within a word. In addition, all characters of a 
term are of the same type. Chararter types for J^nese are 
Kanji. Katakana, Hiragana and Roman (diaracters that are 
used in English and other Western languages). Numerals 
count as word characters. Characters that cannot occur 
within a word — such as the space, comma and other word 
separators — are ignored. These characters are never ^ 
included in terms. After or during the process to create 
preliminary terms, all preliminary terms go through a pro- 
cess of normalization discussed in more detail later in this 
document 

After normalization, some terms go directly into the index ^ ^ 
with no further processing. These are terms of a *1irst type**. 
Using Japanese as an example, terms of a first type are 
Roman and Katakana terms. Kanji terms are of a second 
type that must go through an additional process of step 
indexing. For each preliminary index term a second type. ^ 
the system stq>-indexes the symbols in the preliminaiy 
index term to aeate a plurality of index terms of a length less 
than a predetermined step size. 

In the current in^ilementation for Japanese. Hiragana 
terms are treated as stop words and are not included in the 53 
index, alternative embodiments of the present invention may 
include a stop word list of Jj^>anese words. For inq)lemen- 
tations with a stop word list all terms that are not in the stop 
word list including Hiragana terms, will be included in the 
index. 

The content index is then created by associating the object 
with each of the index terms generated in die above manner. 
After creating the content index, the content index can be 
used to generate search results in d)e known manner. 

A system constructed in accordance with the invention 

con^ses two primary parts first is a module that 

accepts textual input from various filters that handles docu- 
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ments and other files stored in a predetermined format. This 
module creates preliminary index terms that are processed 
further to create index terms. 

Second, the invention fiirther includes a token maker that 
creates the index terhis from ttie preliminary index terms. 
String indexing and searching is used to handle words 
encoded in katakana or hiragana <x roman symbols, and step 
indexing and seardiing is used fcr handling certain types of 
strings such as kanji. by breaking collections of symbols into 
substrings of a relatively small number. The token maker 
produces tokens as an ou^ut. which map to terms (keys) of 
an index. The tokens of a document map to corresponding 
term or document associations in Ihc index. The term/ 
document associations in the index are preferably organized 
(sorted) by the terms, much In the same way that word and 
subject references in an index of a book are alphabetically 
ordered. The index is then used to quickly locate all docu- 
ment associations to a given tenn. 

Stated in other words, the invention comprises word 

breaking— creating preliminary index terms based on 

simple character level rules sudi as looking for character 
type transitions within a character set or looking for word 
separators (if any) that may be encountered in the text to be 
indexed or searched. Then, a normalization process is con- 
ducted to add to or r^lace characters in the file. Sometimes 
two characters or d&aracter combinations may be used 
interchangeably in a certain language. For exanq>le. varia- 
tions between upper and lower case in the English language 
do not usually have much impact on meaning. Normalizing 
to a single case makes the index smaller. The next step is to 
remove noise or stop words, which are tolcens that occur so 
frequently in the particular language that indexing oc search- 
ing on such terms in usually unproductive. For exaiiq>le, in 
the English language such stop words include 'the**, "of*, 
''and**, etc. Finally, the steps of step indexing and searching, 
and string indexing and seardiing are earned out. 

Accordingly, it is an object of the invention to provide an 
improved method and system for fast indexing and search- 
ing for documents encoded in compound-word languages. 

It is another object of the invention to provide an 
improved method and system for fast indexing and search- 
ing on documents encoded in shift- JIS Japanese represen- 
tations. 

It is another object of the invention to provide a content- 
index creating and searching system that can be readily 
inccxporated into existing computer-based indexing and 
searching systems. 

These and other objects, features, and advantages of the 
present invention may be more cleariy understood and 
appreciated from a review of the following detailed descrip- 
tion and by reference to the appended drawings and claims. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is an overview block diagram of the process used 
to generate search results with a content-index. 

FIG. 2 is an example diagram of a File Open dialog for a 
Japanese language application program that incorporates the 
methods and systems of the present invention. 
60 FIG. 3 illustrates a typical iix^lementation of a content- 
' index for a collection of documents, an ^example of a 
mixed- symbol Japanese text string to be indexed or 
searchecC and a key buffer used to fadiitate indexing and 
searching. 

65 FIG. 4 is a block diagram of a general purpose computer 
for practicing preferred embodiments of the present inven- 
tion. 
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HG. 5 is a flow diagram of the computer program code or kanji characters, which in the preferred embodiment are the 

routine for creating a content-inde:^ in accordance with the preUminaiy index terms of the second type, the index is 

present invention. created by taking the collection of symbols forming the kanji 

FIG. 6 is a flow diagram of a routine for indexing strings. character string, and aeating a number of index terms each 

FIG. 7 is a flow diagram of a routine for step indexing. ' of a length the same as the step size, begimiing with the first 

„ „ . « .. ^ . ^ . term m the strmg. and extending to the end of the kanii 

FIG. 8 IS a flow diagram of a rouune for creatmg search thereaft^ progressively reducing the step size 

terms and displaymg search results. such that the last character in the kanji string is the last index 

FIG. 9 is a flow diagram of a routine for conducting a terra. In this manner, all kanji terms are taken in "chunks" of 

search. lo the step size or less, always beginning with one of tiie kanji 

FIG. 10 is a flow diagram of a routine for forming kanji symbols and always ending with a symbol at the end of a 

search terms. string of fotir or ending with the last symbol in the string. 

FIG. 11 is a flow diagram of an alternative routine for The reason for step indexing is to cause the system to treat 

forming kanji search terms. every kanji symbol or character as the potential beginning of 

FIG. 12 is a flow diagram of a modification to the a word. Furthcmaore. a step size is utilized that is equal to 

content-index search code when used with search criteria longer than most wards in the language in question. For 

that goes beyond a search that can be resolved exclusively Japanese, a step size of four is believed to be (^timal. The 

using a content-index. document is then indexed by all tokens produced by the step 

indexing method. For example, the string "abcdefg** yields 

DET AILED DESCRffTION OF THE 20 the tokens "abed" *1)cde'*, "cdcr. "dcfg", "efg". *tg^ and 

PREFERRED EMBODIMENTS "g**. 

The present invention provides methods and systems for At search time, the same rules of token making are used 

generating a search result that identifies objects that satisfy to create search terms from a search query provided by a 

a search criteria, especially for text and compound-word user, for the most part. For strings of the first type, typically 

languages such as Japanese or Chinese. According to the ^ roman or katakana, the entire string is utilized as a search 

invention, a user or a query program generates a queries term. A Roman search term must match an index term 

regarding objects that are indexed by the content-index. In exactly (unless the user has added pattern matching through 

response, a search system or engine responsible for execut- the use of "wild card" characters like match any 

ing the query uses the content index in certain cases to character, or match any number of characters). For 

generate a search result, and in other instances conducts a Katakana terms, any index term that includes the search term 

direct search on the collection of objects. The generation of is considered a match, and all object associations 

the search results is accomplished using well-known (documents) are returned from the index. This is similar to 

mechanisms, such as searching the content-index for the ^ search in which the wild card character is added to the 

indexing terms specified by the search criteria and retrieving beginning and end of a search term, 

references to objects that contains those tenns as indicated For a kanji string (or any other string of the second type), 

by the content-index. any index term that begins with the search term string is 

In accordance with the invention, the content index is considered a match, and all object associations are returned, 

created by generating a reference to each object that contains This coirclates to a search in which the ***** wild card 

an index term by first creating a preliminary index term for 4^ character is added to the end of the search string. According 

each of a plurality of terms delimited by a word separator or to one aspect of the invention, if the search term is greater 

a character-type transition. For each preliminary index term than a pre-defined step size in character length, step token 

of a first type, the preliminary index term is used as an index formation is carried out in a manner sinular to that described 

term. For each preliminary index term of a second type, the for step indexing. In the current implementation for Japa- 

system step-indexes the symbols in the preliminary index 45 ticse Kanji. the step size is 4. The kanji search string is 

term to create a plurality of index terms or tokens of a length broken into search tokens of a step size of four or less. Each 

less than or equal to a predetermined step size. The content of the search terms produced in this manner has the wild 

index is then created by associating the object wiA each of card character ****" added to the end of the search token. 

the index terms or tokens generated in the above manner. Then, all tokens are connected together with an AND 

After creating the content index, the content index can be jq Boolean Operator. 

used to generate search results in the known manner. In accordance with another aspect of the invention, any 

In particular, in compound word languages such as Japa- documents returned by the index search are then searched 

nese and Chinese, certain characteristics in a string oi sequentially with a direct search to verify that the full search 

characters in a text buffer (that is. an object to be indexed) string is matched. This Is essentially the same operation that 

are examined to determined the location of known character 55 Is performed for **|^irase searching** in Western language 

separators or character type transitions. This initial word representations. The index search therefore returns poten- 

hreaking creates preliminary index terms that are then tially matching documents, which are Aen searched directly 

normalized to create a second set of preliminary index terms. to verify any matches. 

The preliminary index terms of a first type are used as index in the current embodiment for Japanese, it is believed that 

terms or tokens **as is". lypicaUy, character strings of the eo a step length of four wiU produce acceptable performance 

first type include roman and katakana type characters, which since it is estimated that 95% of kanji search words that a 

are utilized as index terms without further processing. In the user will enter wOl contain four or fewer characters. Making 

current embodiment for Japanese shiftnS characters, char- a step length shc«ter saves on the size of the index since 

actcr strings of hiragana symbols are ignored index terms are shorter and fewer in number. Although other 

Preliminary index terms of a second type are step-indexed 65 step lengths can be utilized, for example, three or five, it is 

to create a plurality of index terms having a string length less believed that poorer performance will occur. When the step 

than or equal to a predetermined step size. In the case of length is three, it is believed that the likelihood increases by 
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roughly 20% that a search word will be longer than the step The mediods and systems of the present invention can be 

length, forcing a slower direct search any retrieved embodied in a File Open dialog, such as that provided by a 

documents to verify accuracy. Similarly, it is believed that a word processing aj^lication. to open for editing a particular 

step length of five or greater will bring diminishing returns object (e.g. a document or set of documents) that match a 

at the expense of a larger index, since the Japanese language 5 specified seardi criteria. FIG. 2 is an example diagram of a 

does not contain an extraordinarily high number of words p^g Qp^^ dialog 201 that incoqxrates the methods and 

that contain five or more kanji symbols. systems of the present invention. The File Open dialog 

Token making for other compound-word languages such window 201 contains search result list box 202, search string 

as Chinese, Hebrew, Arabic, etc. may be constructed in a ^^jy or edit field 203, and various buttons, e.g., the 

manner similar to that as described in connection with the jq ^p^^^j j^^^- ^^^j^ 204, and an "Advanced Search" button 

Japanese, except of course there is no counterpart to kata- ^OS (all in Japanese characters, in this exaiiq>le). These 

kana or hiragana symbols. Preferably, step indexing and ^^^^^^ ^ depressed in the convenUonal manner with a 

searching is applied to Ounese character strmgs m the same ^ device e.g. a mouse pointer and cUck operation, 

m^er as for Japanese. or curs^ keys and liturn (RElTkeys. 

FIG. 1 IS an overview block diagram of the process used , ' „„ ' ^ \ ^/ . t_ 

to generate a search result in accordance with the invention. ^hcn a user wants to find a file with contents that matches 

A query 101 is generated by a program or by a user and sent « «^ ^1^:^^" 

as input to search system 102. TTie search system 102 search strmg edit field 203 and press the Find Now button 

includes a content index 103. a content index search system 204 to mstnict the word processmg apphcaUon to find aU of 

104, and a direct search system 105. For the purposes of this ^ ^hc document with contents that match the search strmg 

invention, the content-index search system 104 is preferably ^ specified in edit field 203. 

code that searches a content-index 103 based on a query and The "Advanced Search" button 205. when depressed, 

generates a search result 110. generates an additional dialog (not illustrated), which allows 

The direct search system 105 is preferably code that "scr to specify a more complex search criteria, for 

direcUy searches on an object such as a document cr a file example a phrase search or a proximity search, or certain 

stored in the conqjuter system's memory, based on the query operators (e.g. Boolean AND, OR, NOT). ^)ecifically, if the 

and generates the search result 110. or a portion of tiie search user wishes to specify a combination of text strings to search 

result. A direct search, as will be known to those skUled in fo'- then the user uses the Advanced Search dialog to enter 

the art, is a search that involves comparison of a search the text strings and the way in which the text strings should 

string or token of a given length to each possible string of the ^ *>e combined. For example, die user could spedJy a search 

given length in the file, starting with the fint character in die to aU documents containing die word "patent" or the 

file and continuing through the file sequentially until each word "applicaticm" or botii wOTds (sometimes denoted as 

grouping of characters of the given length has been com- "patent OR api^cation ). 

pared to the search string and the last character in the file is Still referring to FIG. 2. tiie search string edit field 203 

encountered. 3^ contains an exemplary search string 

The search results 110 comprises a list of objects, such as "KKKKKkkkksRnrshhhhKKK". where K=kanji characters, 

document file names and/c^" path names in the directory, dial k=katakana characters. R=uppercase roman characters, 

identifies objects that satisfy the search criteria, r=lowercase roman characters, h^iiragana characters, and 

One skilled in the art will recognize that the direct search s^separator characters. For purposes of the discussion 

system 105 may be an existing system and tfiat the content 40 examples, these symbols K, k, R, r, s. h, etc. wiU be used 

index search system 104 operates in conjunction therewith. "istead of Japanese characters, it being understood that in 

Alternatively, the content-index search system 104 and the the preferred embodiment, such characters are displayed in 

direct search system 105 may be part of the same system. a Japanese character font, as is shown on the buttons in HG. 

Furthermore, the content-index system 104 may operate 2. 

witfiout the additional search result verification provided by 45 As shown in FIG. Z the search result list box 202 

the direct search system. Other variations are also possible. cuirenUy contains the names of the files tfaa contain the text 

In particular, the present invention is operative in con- string specified as the search string edit field 203 after die 

junction and is compatible with the methods and systems user has pressed the "Find Now** button 204. Specifically, 

described in the patent application entitled **Method and the search result list box 202 contains the names of three 

System for (jcnerating Accurate Search Results Using a 50 files, "c:\mydoc\johndoe\docl.txt.** 

Content Index", application Ser. No. 08/477.486. filed Jun. "c:\mydoc\johndoe\doc2.txt." and 

7, 1995, (hereinafter, the 'XHontent Index** patent) the dis- "c:\mydoc\janedoe\doc5.xls.*' which contain the string (or a 

closure of which is inccapOTated herein by reference and portion thereof) shown in edit field 203. The search result 

made a part hereof, and which is owned by the same that is displayed in the seardi result list box 202 is generated 

assignee as the present invention. 55 using the methods and systems of the present invention. 

The present invention also provides the ability to conduct TVpically, the result of the search is generated using a 

a search result when the collection of objects is only partially content-index that indexes die files of the file system, 

indexed by the content-index. In such cases, a content-index Although the present invention is discussed specifically 

inclusion rule is provided for determining whether a given with reference to documents as objects, one skilled in the art 

object is indexed by the content-index. The pc^on of the 60 will appreciate that the [resent inventicMi is usefiil in other 

collection of objects indexed by the content-index is referred contexts as well, such as with any object that may be indexed 

to as the domain of die content-index. To accommodate for searching purposes. For example, a graphical object 

partial indexing, the methods and systems used to generate such as an electrical drawing, that contains symbolic 

an initial search result are modified to search the remaining information, such as bitmaps of transistors and NAND gates, 

portion of the collection of objects not part of the domain in 65 can be indexed in a content-index using gra^^cal bitmaps, 

addition to using the content-index, typically by direct A content-index seardi system for these graphical objects 

searching. determines matches by searching the object contents f<^ the 
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preseoce of the indexed graphical bitm^. for example, by 
searching for a pattern of bits. In a similar manner, any 
content-index that indexes a collection of objects is subject 
to the methods and systems of tiie present invention as long 
as a content-index search system is implemented. 

FIG. 3 illustrates a typical implementation of a content- 
index for a collection of documents. The documents shown 
ia FIG. 3 are those discussed with reference to FIG. 2. The 
content-index 301 is shown after it has been generated to 
index the "docl.txt" document 321, the "doc2.txt** document 
322, and the "doc5.xls" document 323. The content-index 
301 comprises an inverted list 302 and an object list 303. 
The inverted list 302 is arranged such that it ef&ciently stores 
the indexing terms and the references to the documents that 
contain each term. In a typical in^lementation such as that 
shown in FIG. 3, the inverted list 302 conqidses a directory 
structure 304, which contains the indexing terms, and a leaf 
structure 305. The leaf structure 305 contains leaves 
309^12, which contain the references to the indexed docu- 
ments. The directoiy structure 304 stores the indexing terms 
(or other indexing information) In a data structure that 
allows efficient location of the desired term. For example, 
directory structure 304 is shown implemented as a binary 
tree (B-tree), which contains three nodes: node 306, node 
307, and node 308. The letters "A." ^B." "C." and "D" 
represent the indexing terms, and correspond to exemplary 
kanji string KKKKK331, an exemplary Icatakana string kkk 
332, an exemplary roman string Rrrr 333 (with initial 
uppercase letter), and a hiragana string hhh 334. respec- 
tively. The leaf structures 305>-312 each contain references 
to the documents that contain the indicated indexing tenn. 

Consider, for example, document 321 and document 323. 
which both contain the term "A." Node 307 in the directory 
structure 304 contains an entry for the Indexing term **A." 
This entry points to leaf structure 309, which contains 
references to two documents labeled "1" and "3." In the 
particular implementation shown, the leaf structures 
309-312 point to a centralized list of documents for the 
entire content-index (object list 303) to avoid storing large 
or redundant amounts of information in the leaf structures 
themselves. Thus, the references to. a doctmient **1** and a 
document in the leaf structure 309 indicate which 
documents in the object list 303 contain the indexing term. 
Specifically, leaf structure 309 indicates that the document 
referred to by the first entry 315 in object list 303 contains 
indexing tenn **A^ and that the document referred to by tht 
third endry 317 in object list 303 also contains the indexing 
term ''A.** The object list 303 also contains additional 
information regarding each object (document) that is 
indexed by the inverted list 302. As shown, object list 303 
contains in each entry the name of the object, the location of 
the object (a directory path), and a timestamp indicating the 
last time the object was modified. By examining the referred 
to entries in object list 303. the names and locations of the 
documents containing the indexing term "A** can be 
retrieved. Thus, the first entry 315 refers to the document 
"docLlxt" 321 and the third entry 317 refers to the document 
"docSjcls.** both erf which contain the term "A." 

A content-index such as that discussed in conjunction 
with FIG. 3 is used to generate the contents of the search 
result list box 202 of the File Open dialog 201 in FIG. 2. 
When the user presses the "Find Now" button 204, the code 
that implements the File Open dialog invokes a search 
system which uses one or more content indexes if these are 
available. Each content index uses the directly structure 
304 of the inverted list 302 to find the node(s) that 
correspond($) to the one or more search criteria (search 
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terms) specified in edit field 203. When a term is located in 
the directory structure 304, the leaf structure associated with 
the corresponding node is examined to retrieve the refer- 
ences to the documents that contain that indexing temL With 
5 respect to the example of FIG.2, the content-Index search 
system uses the content-index to find all of the docimients 
that contain the term "C* (e.g. the Roman string RRRR in a 
shift-JIS representation) and the term "D" (i.e., a hiragana 
string hhh in a shiftJIS representation). As seen in the 
inverted list 302, the documents referred to by references 
"I*' and **4" match this seardi criteria, and the documents 
"docl.txt" 321 and "docl0.txt" are referred to by to entries 
1 and 4 in the object list 303. Once the document references 
have been retrieved, the content-index search system gen- 
erates fully qualified names (pathnames) of the dociunents 
by examining the proper entries fi-om object list 303 and then 
stores the pathnames as an initial search result The initial 
search result contains the name 
"c:\mydoc\johndoc\docl. txt" and 
"c:\mydoc\janedoe\doclO,txt**. 
20 Note, however, that this initial search result is incorrect as 
seen by examining the illustrated contents of documents 
321, 322. and 323. Specifically, assume that the document 
"doclO.txt^ is no longer part of the collection. Also, assume 
that the docunoents "doc2.txt'* 322 and '^docS.xls*" 323 have 
25 been modified since the content -index 301 was last updated 
and now match the search criteria because they both contain 
the terms "C* and "D." 

Search result correction routines may be invoked to 
correct the initial search result. Specifically, sudi search 
3Q result correction routines can determine that the document 
"c:\mydoc\janedoe\docl0.txt'* is no longer part of the col- 
lection and remove the reference to diis document from the 
initial search result In addition, search result correction 
routines can determine that documents 322 and 323 have 
33 been modified since the time indicated by die timestamp 
contained in the content index entries 316 and 317. which 
correspond to these documents. Each of these modified 
documents is then directly examined (a direct search) to 
determine whether it matches the search criteria. After 
determining that documents 322 and 323 now match the 
search criteria, the search result correction routines add 
references to the documents "c:\mydoc\johndoeVdoc2.txt** 
and "c:\mdoc\janedoe\doc5jcls** to the initial search result 
The code that implements the File Open dialog then displays 
43 the corrected search result in search result list box 202. 
One skilled in the art will recognize that the search result 
displayed in list box 202 can be inorementally generated and 
the incremental changes can be displayed as they are deter- 
mined. Alternatively, all of the corrections to the initial 
50 search result can be determined before updating the dis- 
played list. Odicr similar variations are also possible in 
conjunction with the methods and systems of the present 
invention. 

Still referring to FIG. 3, and as will be discussed in greater 
55 detail bdow. the preferred embodiment of the invention 
utilizes a key buffer 360 to temporarily store information 
from a text buffer or file as the index terms or search terms 
are formed from a string of mixed text. The key buffer 360 
holds tuples that reference keys (substrings) within the main 
60 text buffer. Each tuple preferably contains ( 1 ) a pointer to the 
key within the main text buffer, (2) the key's length in bytes, 
(3) character type. e.g.. kanji. katakana. etc. In addition, the 
tuple may contain additional type infcHination (e.g. date, 
number, text Boolean) or other information relating to the 
65 document being processed. 

For exanqjie, the exemplary Japanese text string 350 
KKKKKkkksRnrshhh represents a mixture of word diar- 
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acters of type kanji. katakana. Roman and hiragana. The 
string also includes a non-word character indicated by an **s" 
in this example. This character could be a space, a comma 
or any of a number of word separators used in Western 
languages. This character might also be a Japanese middle- 
dot character or any other Japanese character that is not 
found within words. Qiaracters in this non-word "other" 
category are simply ignored. 

This string in this cxanq>le could either be text in the text 
buffer to be indexed, or text entered by a user in the search 
string entry box 203 in FIG. 2 for use in forming search 
tokens. Note that each character or symbol in the string 350 
has a predetermined position ranging from 0 to 16. Each of 
these cfaaractears has a type and a position; a sequence of 
characters of the same type has a string length. As such, a 
substring can be represented in the key buffer 360 by a key 
offset parameter (KO). a key length parameter (KL). and a 
type, where K=kanji. lc=katakana, s=separator, R=u;^rcase 
roman. r=lowercase roman. and h=hiragana. For example, 
the five character kanji string 331 begins at location 0 and 
extends for five characters, as represented by the mple (KO. 
KLMO. 5) of type K. In like manner, eadi of the substrings 
can be represented by the tuples indicated in the key buffer 
360. As described below, data in a text buffer or in the search 
string entry box is processed into the key buffer prior to 
further processing to form preliminary index terms, index 
terms, or search terms. 

In preferred embodiments, the methods and systems of 
the present invention are in4>lemented on a computer system 
comprising a central processing unit, a display, a memory, 
and input/output devices. Preferred embodiments are 
designed to operate in an operating system environment 
sud) as the Microsoft WINDOWS environment defined by 
Microsoft Corporation in Redmond, Washington. One 
skilled in the art will recognize that embodiments of the 
present invention can be practiced in other operating system 
environments. 

In this regard. FIG. 4 is a block diagram of a general 
purpose computer 400 for practicing preferred embodiments 
of die present inveotioD. The computer system 400 contains 
a central processing unit (CPU) 401, a display saeen 
(display) 403. one or more ii^)ut/output devices 404. and a 
computer memory, comprising a permanent storage device 
405 such as a hard disk drive, and a random access menKHy 
412. By use of the term "'memory", we mean a permanent 
storage device 405. random access memory (RAM) 412. or 
both, in whole or in part The memory stores various 
computer programs, typically resident in '"permanent" form 
on the permanent storage device, but portions of whi(^ may 
be resident in RAM for execution by the CPU 401. 

In accordance with the invention, there is provided a 
content index 410. a search results store 411. a content index 
program 415. a search program 418. a key buffer 420. a 
computer operating system 425, and perhaps other programs 
430. 

It will be understood that the search program 418. the 
index program 415. and the code used to store the content- 
index and the search results preferably reside in the memory 
and execute on at least one CPU such as tiie CPU 401. These 
programs are shown residing in RAM 412 as well as in the 
permanent storage device 404. along wid) other jM-ograms 
430. A content-index, such as that described with reference 
to FIG. 3. is shown as content-index 410 also residing in the 
memory. A seardi result, when generated in rc^nse to a 
query, is shown also in the memory as search results store 
411. 



12 

The memory is also shown contaimng the objects 406, 
407. 408, and 409. which are indexed by the content-Index 
410. Alternatively, these objects and various parts of the 
context index 410 or search results store 411 may reside on 
5 an input/output device 404 such as permanent storage device 
405. 

Although die computer system 400 is shown as a single 
conoputer, one skilled in the ait will appreciate that die 
present invention may be practiced on iH-ocessing systems 
with varying architectures, including networked 
environments, multiprocessor environments, and on systems 
with hardwired logic. 

In one aspect of die invention, a preferred embodiment 
provides an index system and an Index search system 
(program or code module) f<x carrying out the methods of 
^ the present invention. 

FIG. 5 is an overview flow diagram of the create content- 
index search code 500 according to the present invention. 
The create-index code takes a file of text in a conqx)und- 
word language such as lapanese or C:hinese, filters the text, 
and then "breaks" the resulting string into a i^urality of 
preliminary index terms. Preliminary index terms are each a 
longest substring that contains only word characters and 
only characters of a single type (Kanji, Katakana. Hiragana 
^ or Roman). After creating the preliminary index terms, 
which are stored in the key buffer, the preliminary index 
terms are further processed to index the string and aeate die 
content-index by associating the object with each of Its 
index terms. 

^ Specifically, at step 501 an object such as file or document 
is opened or otherwise identified to the code module to 
specify the object diat is to be indexed Contents oi tiie 
object first filtered at step 505 to eliminate pictures, 
graphics, formatting and other non-textual information. The 
resulting text is fed. at step 503, into a text buffer, which is 
a section of memory allocated for ten^rarily storing all or 
at least a porti(Hi of the data to be indexed. 

At stq> 510. steps are begun to create die preliminary 
index terms. First, a given character C in the text buffer is 

^ retrieved and examined. At step 513. if the character C is a 
separator, the key offset (KO) and key length (KL) and 
type=s (separator) are entered into die key buffer at step 515. 

If at step 513 the character is not a separator, the **no^ 
branch is taken to step 519. If the diaracter C represents a 

45 character type transition, for exan^le from kanji to 
katakana, the ^cs** branch is taken to step 521. At step 521. 
the values of KO. KL and type are entered into the key buffer 
as delimiting a preliminary index term. 

As described elsewhere, character types for Japanese in 

so shift- JIS are kanji. hiragana. katakana. and roman (ASCH). 
Substrings, that is. preliminary index terms, are formed at 
character transitions because characters of different types 
can never be in the same word, with a small number of 
exceptions for characters shared in common by hiragana and 

55 katakana. The output of the siiiq>le word brealdng stage, set 
forth in inquiry boxes 513, 519. arc strings, which are 
candidates for preliminary index terms, in which all char- 
acters are of die same type. e.g.. all kanji. all roman. all 
irafflVflna, OT all hiragana. 

60 If at step 519 there is no character type transition yet, the 
*'no** branch is taken to step 523. where the inquiry is made 
whether the end of file (EOF) has been encountered. If not 
the *^o** branch is taken back to step 510 and the next 
character is examined. The above-described stq}s repeat 

65 until encountering the end of all the text in the text buffer. 
When die EOF marker is seen at stq> 523. the **yes*' branch 
is taken to step 525. 
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At step 525. the strings in the key buffer (tfiat is. the After step 611 if the character type is not kanji. the **no** 

prelimlnaiy index terms) are normalized. The process of branch is taken to step 621, where the inquiry is made 

normaiization entails adding or replacing jx-edeteimined whether the string Sn is of type hiragana. If so. the **yes** 

characters of the character string forming the preliminary branch is taken to step 623. and the hiragana string is deleted 

index terms in fte key buffer. It is known that sometimes two 3 from the key buffer. In the current embodiment, hiragana 

characters or character combinatioos are used interchange- strings are ignored (that is, they are not added to the index), 

ably in certaio languages use of one combination or other since hiragana is typically used only to write verb inflections 

does not affect the meaning much, if at all. For example. and preposition-tyjpe word such as **to** and "oT. However, 

variations between upper and lower case in roman (English) some words (common words or sometimes in children's 

do not usually affect meaning for purposes of indexing. In books) are also written in Hiragana. A preferred impiemeo- 

Japanese shift-JIS coding, certain katakana and ASCII char- tation for Japanese would include a Japanese stop word list 

acters can be represented using either a single or double-byte All Hiragana terms not on this list would be indexed as type 

character code. Regardless of code, characters ^^pear much 1 terms, just as for Katakana and Roman, 

file same to a user and codes are often used interchangeably. From step 623, the code branches to step 625. where the 

In accordance with the invention, all katakana and roman inquiry is made whether there are any more keys in the key 

strings are normalized to a single byte representation. Nor- buffer to be i^ocessed. If so. the '"yes** branch is taken back 

malization eliminates a certain "noise'" factor by changing to step 601 to process the next string Sn. If the key buffer has 

any double-byte representations to a single-byte represen- been completely examined, the ''no** brandi is taken and the 

tation. Those skilled in the ait will understand which char- index strings code routine exits. 

acters in the shift-JIS representation may be normalized in ^ The process described in FIG. 6 is also followed in order 

this maimer. to search one or more documents directly neither as a 

After normaUzing the strings in ttie key buffer at step 525, foUow-on to an index search or when documents must be 

the routine for indexing strings in the key buffer is carried searched that are not included in an index. For direct 

out at step 600. searching, however, there is no step indexing (step 700) nor 

One skilled in the art will recognize that the content index 2$ terms added to the index (step 605). 

program 500 for creating the preliminary index terms need Turning next to FIG. 7, the steps taken in the code for step 

not necessarily be executed in the order shown in FIG. 5. indexing, shown at step 700 in FIG. 6. wUi next be 

More specifically, in embodiments that support parallel described. The stqf> index code 700 is utilized to create 

processing or threaded processes, the routines may be to tokens of length AX, which represents a maxiinum step size, 

some extent executed in parallel or as separate threads. For For Japanese represented in shift-JIS. in the preferred 

example, a file can be split into a number of different text embodiment the step size is preferably four or smaller. In 

buffers and processed independently, since there is no particular, the code shown in FIG. 7 takes a string such as 

requirement that a file being indexed be handled in any "abcdefg** and yields the substrings or tokens "abed**. '"txjde** 

particular sequential manner. Also, certain of the routines "cder, "defg**, "efg**, 'tg**, and "g**. The idea for step 

could be executed in reverse order. Diffaent variations are 33 indexing is to treat every kanji character in a substring or 

possible depending upon the optimizations desired, which preliminary index term as the potential beginning of a word, 

will occur to those skilled in the art since one cannot be certain where words begin in a kanji 

FIG. 6 is a flow diagram <^ the index strings code 600. string. Furthermore, the step size should be that equal to or 

which corresponds to the step 60t in FIG. 5. This routine longer than most words encountered in the language of 

processes the data in the key buffer, that is. the preliminary 40 interest. All tokens produced by step indexing form index 

index terms, and ixeates index terms for certain character terms. 

types and carries out step indexing, as will be described, for Starting at step 701. the key buffer entry for the kanji 

other character types, more specifically kanji. string being processed is read from the key buffer to obtain 

Starting at step 601. Che first step taken is to get a string the mple (KO. KL). At stq> 703. the value of a temporary 

Sn. tiiat is. a preliminary index term, from the key buffer. At 43 variable MAX is set to four. The variable MAX represents 

step603. ifthe string Sn is ofa roman type. Che ""yes^ branch the step size, which of course can be varied for other 

is taken to step 605. and the entire string, whose position is languages. However, it is believed that for Japanese encoded 

indicated by the value of the parameter KO in the key buffer in shift-JIS the optimal step size is four, 

and of length KL, is utilized as a final index term- Then, ttie At step 705, die key length parameter KL for the kanji 

index term so fonned is added to the index in the conven- 50 string is compared to MAX. Assume for purposes of this 

tional manner. Those skilled in the art will understand that discussion that the preliminary index term consists of five 

the process of adding a term to a sorted index constructed in kanji characters KKKKK. which we will represent as 

a binary tree comprises traversing the tree to locate the "abcde**. If KL is greater than MAX, the *'yes*' branch is 

positicMi in the B-trec for the index term and creating an taken to step 709, where a tefiq>orary variable KLTeix^ is set 

association between that index term and the object fa: file 55 to the value of MAX. At step 711, the value of KO, KLTemp 

being processed. is added to the key buffer. This adds the kanji string abed to 

If at step 603 the string Sn is not roman. tttc '"no" branch the key buffer for use as an ultimate index term, 

is taken to step 609, where the string Sn is examined to see At step 713. KO is rq)laced by KOfl, and KL is replaced 

if it is of die type kafaknnn . If so. the '"yes** t>ranch is taken by KLr-1. In other wc^ds. a pointer moves to the next 

to step 605. If at step 609 die prelimiiiary index term Sn is 60 sequential location in the kanji string and the length of the 

not katakana , the "no** branch is taken to step 611, where Che string is decremented by one to indicate that a first token has 

inquiry is made Aether the character type is kanji. If so, Che been formed. Then, at step 717, the value of KL is compared 

•*ycs** branch is taken to step 700. where a step index routine, to MAX. If KL is greater than MAX. the **yes** branch is 

described in connection with FIG. 7, is executed. After taken back to step 709 and the steps 709, 711. and 713 are 

returning from the step index routine at step 700. the code 65 repeated. In this marmer. the exemplary string "bcdc" is 

branches to step 605 and the index term is added to the added to the key buffer. These steps would contiDue for 

strings longer than the exen^>lary string, of course. 
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At Step 717. when KL is no longer greater than MAX, the At step 811« the first character C in the text buffer is 

'"no** branch is taken to step 721. When this branch is taken. examined. If at step 813 the character C is a separator, the 

the length of the remaining characters in the preliminary '"yes" branch is taken, and the value of (KO. KL) and typ&=s 

index term is less than MAX. indicating that the final is entered into a search term key buffer at st^ 815. and the 

character in the string now forms the last character in a 5 flow returns to step 811. ff at step 815, the character is not 

substring of size MAX in the previous opca-ation. Then, at ^ separator, the "no" branch is taken to step 819 and the 

step 721 the present value of KO, KL is added to the key inquiry is made whether there is a character type transition, 

buffer. In the example being discussed, the substmg "cde" jf 1,^^ ^^^^ ^ 821, and the inquiry 

is added as an index term or token to the key buffer. ^^^^^ ^^^^ 

At step 723. KO is replaced by KQf 1. and KL is replaced ,0 encountered, which is typicaUy the presence of 

by KL-L At step 725, the value of is examined to see ^ ^ ^ ^^^^ 1^ „^ 

If zero has been reached. If not. the no branch is taken to ^^^^^ ^ 

step 721. and successively smaller tokens are created as , 

index terms, each ending in the final kanji character of the At step 819, if a character type transiUon has been 

preliminary index term. The steps are repeated until only the , encountered, the ^cs branch is taken to step 825 and the 

final kanji character in the preliminary index term has been tuple (KO. KL) and ap^pmtc type mdicator is entered^ 

added as a separate token to the key buffer. After KL reaches ^eardi term key buffer. The flow tiicn returns to step 811. 

zero.thc step indexingroutineiscomplete, the **yes-branch ^ * ^trmg of characters of the same type is 

is taken from step 725 and the routine exits. "^^"^ ^ ^^^^^ ^"""^ ^^"^ parameters as 

TTic preceding routines have been directed toward proce- ^ ^"^^^ ^ connection with HG. 5. 

dures for aeating index tenns that associate an object such Returning to step 821. if the end of text has been 

as a document or file with an index term. As those skiUed in encountered, the ^'ycs brandi is taken to step 829, where all 

the art wiU understand, in order to conduct an efficient search terms in the key buffer are normalized in the manner 

search, a user provides as a query one cr more search terras <iescnbed above. As descnbed above, normalization include 

that are utOized as keys to access the content-index and operations that change, add ot delete characters, such as 

retrieve a list of references to aU objects that contain the various katakana "sounds same** transformations, two-byte 

particular index term. A similar procedure to that described single-byte normalization for katakana and ASCII 

above in connection with die creation of index terms is characters, case normalization to transform any upper case 

carried out in order to create search terms from a user- characters to lower case. etc. 

specified seardi string (such as diat of 2©3). ^ After normalizing at step 829. the search is conducted 

Turning next to FIG. 8 in this regard, a routine 80# for utilizing the search terms at step 900, as described in 

aeating search terms will be described. The routine in FIG. connection with FIG. 9. After the search is completed and 

8 is operative to receive a search string or query entered into *e list of objects that satisfy the search criteria are returned, 

the search string entry box 203 shown in HG. 2, derive are displayed as the search results Ust at stq> 837. and 

preliminary search terms, derive search terms from prelimi- 35 routine exits. 

nary search terms, and provide the search terms to a search FIG. 9 is a flow diagram of the search routine that 

engine diat retrieves references to objects containing the corresponds to the routine 900 shown in FIG. 8. In this 

search terms. routine, the preliminary search terms in the key buffer are. 

The first step taken at step 801 is to display die dialog sudi in certain cases, utilized directiy as seardi terms, and in 

as the FUe Open dialog 201 shown in HG. 2. This creates the 40 processed further to form search terras, 

edit field 203 into which the user can type a seardi string Prior to describing the search code steps, it should be 

utilizing an input device such as a keyboard associated with understood that the seardi being described is primariiy that 

a computer system that effects the present invention. of utilizing a content-index as described herein. However, in 

At step 803. a string of characters that is entered into the certain cases, a direct search is utilized in the preferred 

search string entry box 203 is read into a text buffer. At step 45 embodiment. By "direct search", we mean a search of an 

805. the search string is examined for the existence of any object such as a file or document by comparing a search term 

Boolean terms or other sq>arators. The presence of any or character string of a predetermined length with each 

terms such as AND. OR, NOT. or other punctuation terms possible character string of that given string length, taken 

are treated as Boolean operators. For example, a oormna is sequentially in the object or document beginning at the 

treated as a logical OR operator, while any separator other 50 beginning of the file and extending to the end of the file, until 

than a comma is treated as a logical AND operator. These all the contents ci the file have been compared to the key. As 

Boolean terms are utilized in the quay tree. Those skilled in will be understood by those skilled in the art, such direct 

the art will understand that after the search terms are created, searches are slow compared to an index search, but certain 

they arc utilized together with any Boolean operators to types d search operations cannot be conducted with an 

construct a query tree that is used to select which objects 55 ^^^^ search. 

satisfy the seardi criteria, in the known manner. Since the Starting at step 901, the inquiry is first made whether an 

operation of a search engine that utilizes a query tree for index is available for utilization. If not the *'no^ tvandt is 

searching with indexes is known in the art, it will not be taken to step 903^ and a direct search flag is set to indicate 

discussed further herein. that a direct search should be conduded. Typically, a direct 

Id step 809, the text buffer containing the seardi string is 60 search will be conducted within a designated directory or 

filtered to remove any "stop" or **noise'' words. These words subdirectories. 

indude prq>ositions, definite and indefinite artides, etc., and If an index is available, (he ^"yes" branch is taken from 

is typically language specific. After removing the stop step 901 to step 907. At step 907. the inquiry is made 

words, the search string is now ready to be processed to whether user has entered a command to condud a proximity 

identify separators and character type transitions, somewhat 65 search. A **proximity search** is a search to determine 

similar to that descnbed in coimection with FIG. 5, thereby whether a given first search term or key is within a prede- 

deriving preliminary search terms. tcrmined number of characters from a second search term or 
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key. If a proximity search is indicated, the ^"ycs'* branch is step 1001. the inquiry is made as to whether the length of the 

taken to step 911 where the direct search flag is set, and the kanji string search term ST is greater than or equal to the 

program flow passes to step 913. maximum step size MAX, which in the preferred embodi- 

At step 913, the inquiry is made whether a phrase search mcnt is four. If not, die "no" branch is taken to step 1003 and 

has been indicated by the user. A "phrase search" is a search 5 a token T is created from the search string ST for utilization 

of a plurality of terms that appear in sequence (no interven- as the search term in the routine 900. At step 1007, an *****is 

ing terms). Any documents that are identified as containing added to the end of the token T, and the token T is ou^ut as 

the search terms by searching in the index are then further the kanji search term at step 1009 for utilization as the search 

searched directly to identify those documents containing the term. 

desired phrase. If a phrase search is indicated, the *Ves" r— ^ . - ^. . ^ ^ ^ . 

branch is taken to step 915, and the direct search flag is set '° Thus, for a kanji strmg. any index term that begm^ 

Then program flow passes to step 920. ^ considered a match and aU of its 

Starting at step 920. a loop is entered where the prelimi- <^^^^,^^ f .'?^*f ^^f ' ^f. returned TOs is similar to a 

nary search terms from the key buffer are examined to "^^^^ * wild card is added to the end of flie 

determine if they may be utilized directly as search terms, or seardi stnng. 

whether further processing is desired, as in the case of kanji. There is however, one excqption, which is shown taken at 

or whether direct searching should be indicated. At step 920 step 1001. If the length of die search term entered by die user 

a preliminary seardi term ST is retrieved from the key is greater than the step size MAX, the ""y^s" branch is taken 

buffer. At step 923. the type of the entry from the key buffer to st^ 1011. In diis case, the direct search flag is set and tiie 

is examined. If the type is hiragana, the '^yes** branch is taken routine exits. A direct search must be conducted in this case 

to step 925, and the direct search flag is set Hie program ^ because the kanji search string entered by the user is greater 

flow then branches to examine other strings in the key buffer than any of the kanji index terms, forcing a direct search, 

at step 927. If fliere arc more strings in the key buffer at step Referring now to FIG. 11, kanji search terms can be 

925. the *'yes" branch is taken back to step 920 and die next formed by conducting "*st€p searching" in a manner similar 

item in the key buffer is examined. to that described in connection with step indexing as an 

If at step 923 the type is not hiragana, the "no" branch is alternative to the routine shown in FIG. 10. The routine 

taken to step 930 and the inquiry made whedier the type is illustrated in FIG. 11 thcrefcH-e may be considered an alter- 

roman. If the type is roman at st^ 930, the **yes" branch is native kanji search term routine 1000. 

taken to step 933 and die string is used as a word level search Beginning at step 1101, die first step taken to set the 

term. Program flow dien passes to step 935, where the search ^ maximum step size m equal to MAX (four in the pffeferred 

term is utilized to access the ccHitent-index and retrieve a list embodiment). The maximum step size should correspond to 

of objects that contain this particular search term. After the maximum step size utilized to construct the index, 

retrieving the list of objects, dicy are stored in a temporary At step 1103, the lengdi of the search term ST is ccnnpared 

buffo- at step 940, and program flow passes to step 927. to m to see if it exceeds the step size. If so, the ^Ves" branch 

If at step 930 the type is not roman, the **no" branch is is taken to step 1107, and a token T is aeatcd utilizing the 

taken to step 943, and the inquiry is made whether the type first small m characters of the string ST. At step 1109. an * 

is katakana. If so. die **y€s**0 branch is taken to step 945 and is added to die end of the token T, and at step U 13. die token 

the katakana string is formed into a string level seardi term. t is provided as the kanji search term. At step 1117, some 

It should be understood at this juncture that for a katakana number, D, of characters from die search string ST is 

string, the entire index term list is searched. Any index term ^ removed, and die i^ograra flow branches back to step 1103. 

tfiat includes the katakana string is considered a match, and D can vary from 1 to the step size or higher. In the current 

all of its document associations are returned from the index. implementation D equals step size. For example, given a 

This is similar to a search in which the"*" wild card is added Kanji search string represented by "abcdef*. the current 

to die beginning and the end of the search string. Searching in^lcmcntation would "And" together search terms "abed" 

the entire index term list in this maimer takes longer than a and "er. 

typical search of the index in which only exact matches are if at step 1103 the lengdi of die search term is not greater 

returned. However, die seardi for a katakana string wittiin than m, the "no" branch is taken to step 1121. The variable 

the mdex is still bdieved to be faster by an estimated order ^ is replaced by m- 1, and a con^yarison is made at step 1125 

of ten or more than searching aU objects in tiie coUection whetfier itt=0. If not, tiicre are more characters remaining in 

^^^y- 5Q the search string ST andprogram flow branches back to step 

The string level search term, *ST*. is then utUized at step 1103. 

935 at described, and the search results are stored at step when fee variable m has reached 0 at step 1125 die 

seardi string has now been broken into a number of smaller 

If at step 943 die type is not katakana, die **no" branch is search terms of maximum size MAX, and die "yes" branch 
taken to step 950, and die inquiry is made as to whedier die 55 is taken to step 1130. At step 1130, die direct search flag is 

type is kanji. If not, die "no" branch is taken to step 927. If set to force fee direct searching of all objects feat satisfy fee 

so, die 'yts** branch is taken to step 1000, and kanji search seardi criteria to locate fee objects feat sadsfy fee entire 

terms are formed in accordance wife the procedure search term ST which is longer than fee step size, 

described in connection wife FIG. 10. After forming fee Theabovediscussionof embodiments of fee mefeods and 
kanji search terms by fee routine 1000. program control ^ systems of fee present invention has assumed that fee 

passes to steps 935 and 940 as previously described. coUecUon of objects is completely indexed by fee content- 

The result of fee steps described in connection wife FIG. index, or feat fee direct search flag will force a direct search 

9 is fee storage of search results, that is, a list of objects feat of certain objects under certain circumstances. One skilled in 

satisfy fee search criteria, stored in a temporary buffer, so fee art will realize feat ofecr embodiments are possible. For 
feat they can be dispUyed or otherwise utilized. 65 example, in one alternative embodiment fee coUection of 

FIG. 10 describes a preferred routine 1000 for fonning obje<is is only partially indexed by the content-index, 

kanji search terras frompreliminary search terms. Starting at According to diis embodiment there is a content-index 
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inclusion rule that indicates whether a particular object is 
indexed by the content-Index. The portiMi of the collection 
indexed is referred to as ^e domain of the content-index. 
The methods and systems described above are slightly 
modified to incorporate partial indexing. In particular, the 
generation of the initial search result using the content-index 
is slightly nKxiified. 

Specifically, in FIG. 8. the code that generates and dis- 
plays the search result is modified to preferably first use the 
content-index to efficiently generate an initial search result 
and to then directly search the remaining objects in the 
collection that are not in the domain of the content-index for 
additional objects that match the search criteria. Then, the 
code adds the references generated from the direct seardi to 
the initial search result Also, according to this embodiment, 
it is preferable that a flag be included with each reference in 
the stored search result to indicate whether the reference was 
placed in the stored search result as a result of a direct search 
of the object as opposed to as a result of a search using the 
content-index. This flag is used for optimization purposes to 
avoid unnecessary searching of the object in the search 
result correction routines. One skilled in the art will recog- 
nize that the inclusion of such a flag is not necessary and that 
other implementations of preserving such information are 
possible. 

Further details of partial index searching are described in 
the referenced "Content Index" patent 

In yet another embodiment the methods and systems of 
the present invention take into account that not all possible 
searches can be solved using a content-index. The searches 
tiiat can be solved using a content-index depend upon the 
information stored in the content-index. For exan^le, a 
search that involves searching for a particular occurrence of 
a term in a document is typically not solved using a 
content-index unless occurrence infcHmation is also stored in 
the content-index. The "Advanced Search" button 205 in 
FIG. 2, for example, could be used to specify such a search 
criteria. 

For example, a content-index such as that described in 
conjunction with FIG. 3 could store occurrence Information 
for each reference to a document that contains the indexing 
term. More specifically, in one embodiment each reference 
in eadi leaf structure 3#9-312 points to a tuple comprising 
(reference to document oocuaence,, . . . occurrence J where 
the ''reference to document*' is the same as that shown in 
FIG. 3 (eg., "T) and each occucrencCi is an occurrence 
indicator (e.g.. a Dumber), which indicates the location of the 
indexing term within the document. An occurrence number 
could indicate, for exanq>le. that the indexing term is the i^ 
word in the document For example, the tuples (1. 3. 16) and 
(3. 5) substituted for leaf structure 309 indicate that the 
indexing term "A** is found in document "1** as the 3rd word 
and the 16th word and in document **3** as the 5th word. 

Alternatively, if the content-index does not store occur- 
rence information, then a search for a particular location or 
the i**" occurrence of a term in a document is not solvable 
exclusively using that content-index. In this case, the meth- 
ods and systems of the present invention are modified to 
incorporate more complex searches. 

FIG. 12 is a flow diagram of the modifications to the 
seardi code of FIG. 8 when used with a search aiterta that 
goes beyond a search solved exclusively using a content- 
index. Typically, these modifications are employed to i^^)le- 
ment a direct search operation in response to the state of die 
direct search flag set in steps shown in FIG. 9. Specifically, 
in step 1201. the search criteria is divided into a content- 
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index portion, which is flie portion of the search that can be 
performed using a content-index, and a direct search portion, 
which is the portion of the search that requires searching the 
object directly to determine whether the content of the object 

s matdies the search criteria. According to this embodiment, 
the logical operator that joins the content-index portion of 
search criteria and the direct search portion is preferably a 
conjunction (e.g. a logical AND ). That is. the portion of the 
search that is not solved using the content-index (the direct 

10 search portion) further restricts the results of the search 
generated using the content-index portion. In step 1202, the 
routine generates a proposed search result using the content- 
index portion of the search criteria. In st^ 1203. the routine 
directly searches each object referred to by the refereitces in 

1 s the pressed search result for a match using the direct search 
portion of the search criteria. 

Although die present invention has been disclosed and 
described in terms of preferred embodiments, it is not 
intended that the invention be limited to such embodiments. 

20 Modifications within the spirit of the inventioD will be 
apparent to those skilled in the art. The scope of the present 
invention is defined by the claims which follow. 
What is claimed is: 

1. A method in a computer system for generating a search 
23 result that identifies objects that satisfy a search criteria, the 
computer system having a collection objects and a 
plurality of terms, each object containing one or more of the 
terms, the objects t>eing represented in different tyes of 
symbols in a compound word language such as Japanese or 
30 (jhinese. the method cornicing the computer-implemented 
steps of: 

creating a content-index that contains, for each of the 
plurality of terms, a reference to each object that 
contains the term, by: 
aeating a {^eliminary index term of a first or second type 
of symbol for each plurality of terms delimited by a 
word separator or a character type transition; 
for eadi preliminary index term of the first type, utilizing 
^ the preliminary index term as an index term; 

for eadi preliminary index term of the second type, step 
indexing the symbols in the preliminary index term to 
create a plurality of index terms of a length equal to or 
less than a predetermined step size, the plurality of 
45 index terms comprising a collection of substrings of 
symbols sdected from the preliminary index term that 
begins with one of the symbols in the preliminary index 
term and extends to a length of either the end of the 
preliminary index term or to the number <^ symbols of 
5Q the predetermined step size, whichever is sinaller; 
aeating die content-index by associating the object with 

each of its index terms; and 
after creating the content-index, using the content-index 
to generate the search result 
55 2. The method claim 1. furtiier comprising die step of 
normalizing any two-byte representations of symbols com- 
prising an index term to a single-byte representation. 

3. The method of daim 2, wherein the step of normalizing 
is carried out on the preliminary index terms. 
60 4. The method of claim 1. wherdn the object to be 
indexed is stored in text buffer, and the preliminary index 
terms are stored in a key buffer. 

5. The mediod of daim 4. wherein the prelimiDaiy index 
terms are refsesented by a plurality of key buffer entries. 
65 wherein each key buffer entry comprises a tuple containing 
a key length parameter and a key offset parameter relative to 
the text buffer. 
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6. The method of claim 1. wherein the prelimiaary index 
term of the first type is katakaoa in a shift- JIS representation. 

7. The method of claim 1. wherein the preliminary index 
term of the first type is roman in a shift-JIS representation. 

8. The mediod of claim 1, wherein the preliminary index 5 
term of the second type is kanji in a shift-JIS representation. 

9. The method of claim 1, wherein the step of step 
indexing comprises the steps of: 

(a) creating an index tenn of a length equal to or less than 
the predetermined stqp size, beginning with the first 10 
symbol in the preliminary index term of the second 
type and extending for the predetermined step size, 

(b) eliminating th e first symbol in the preliminary index 
term to create a reduced preliminary index term. 

(c) creating an index tenn of a length equal to predeter- is 
mined st^ size, beginning with the first symbol in the 
reduced preliminary index term of the second type and 
extending for the predetermined step size. 

(d) repeating the steps (b) and (c) until the last symbol in 
the reduced preliminary index term coincides with the ^ 
last symbol in prelinunary index term; 

(e) decrementing the predetermined step size; 

(f) creating an index tenn of a length equal to the 
decremented step size; and 

(g) repeating the steps (e) and (f) until creating a last index 
term comprising the last symbol in the preliminary 
index term. 

10. The method of daim 1 wherein the collection of 
objects conqnises a corpus of documents. 

11. A method in a computer system for generating a search 
result that identifies objects that satisfy a search criteria, the 
computer system having a collection of objects and a 
plurality of terms, each object containing one or more of the 
terms, the objects being represented in different types of 
symbols in a con^x>und wcrd language such as J^umese or 
Chinese, and an index associating terms and objects, the 
miethod comprising the computer-implemented steps of: 

receiving a string of text as a preliminary search string; 

creating a preliminary seardi term of a first or second type 4^ 
of symbol for each plurality of terms in the preliminary 
search string delimited by a word separator or a char- 
acter type transition; 

for each preliminary search term of the first type, utilizing 
the preliminary search term as a search term; 45 

for each preliminary search term of the second type, step 
indexing the symbols in the preliminary search term to 
create a plurality of search terms of a lengtti equal to or 
less than a predetermined step size, the plurality of 
search terms comprising a collection of substrings of so 
symbols selected from the preliminary search tenn that 
begins with one of the symbols in the preliminary 
seardi term and extends to a length oi either the end of 
the preliminary search term os* the number of symbols 
of the predetermined step size, whichever is smaller; 53 
and 

using the search terms in the index to generate the search 
result. 

12. The rn^hod of claim 11. further comprising the step 

of removing Boolean terms from the preliminary search 60 
string prior to the step of creating the search tenns. 

13. The method of claim 11, further comprising the step 
of normalizing any two-byte representations of symbols 
con^Hising a search term to a single-byte representation. 

14. The method of daim 11, wherdn the preliminary 65 
seardi term of the first type is katakana in a shifr-JIS 
representation. 



15. The method of claim 11, wherein the preliminary 
search term of the first type is roman in a shift-JIS repre- 
sentation. 

16. The method of claim 11, wherein the preliminary 
search term of the second type is kanji in a shift-JIS 
representation. 

17. The method of claim 11. wherdn the step of step 
indexing comprises the steps of: 

(a) creating a search term of a length equal to or less than 
the predetermined step size, beginning with the first 
symbol in the preliminary search term of the second 
type and extending for the predetermined step size. 

(b) eliminating the first symbol in the preliminary search 
term to create a reduced preliminary search term. 

(c) creating a search term of a length equal to predeter- 
mined step size, beginning with the first symbol in the 
reduced preliminary search term of the second type and 
extending for the predetermined step size. 

(d) repeating the steps (b) and (c) until the last symbol in 
the reduced preliminary search term coincides with the 
last symbol in preliminary search term; 

(e) decrementing the predetermined step size; 

(f) creating a search term of a length equal to the decre- 
mented step size; and 

(g) rq>eating the steps (e) and (f) until creating a last 
search term comprising the last symbol in the prelimi- 
nary search term. 

18. A method in a computer system for providing a search 
result that identifies objects in a compound word language 
such as Japanese or Chinese that satisfy a search criteria, the 
objects contained io a collection of objects, die search 
criteria having a content-index search portion used with a 
content-index to determine a set of objects of the collection 
that satisfy die content-index search portion, the search 
criteria having a direct search portion. Ae direct search 
poition further restricting die set of objects that satisfy the 
content-index seardi portion in order to satisfy the search 
criteria, the method comprising the computer-in^lemented 
steps of: 

recdving a string of text in a compound word language 
such as J^anese or Chinese as a preliminary search 
string, die coni^KHind word language having symbols of 
a first type such as kanji. katakana. and roman and 
symbols of a second type such as hiragana; 

creating a preliminary search term for each plurality of 
terms in the preliminary search string delimited by a 
word separator or a diaracter type transition; 

for each preliminary search term of the first type, utilizing 
the preliminary seardi term as a search term in the 
search criteria; 

for each preliminary search term of the second type, 
setting a direct search indicator, 

in response to the direct seardi indicator, generating a 
proposed list of references to objects that satisfy the 
direct search portion of the search criteria by directly 
searching the collection of objects with the preliminary 
search term of the second type; 

generating a proposed list of references to objects that 
satisfy the content-index portion <^ the seardi criteria 
by searching the content-index widi the search term of 
the first type; and 

providing the search result by listing the coUectioo of 
objects that match the search criteria of the content- 
index searching and of the direct searching. 

19. The method of claim 18. further comprising the step 
of step indexing the symbols in the preliminary search term 
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to create a plurality of search terms of a length equal to or 
less than a predetermined step size, the plurality of search 
terms comprising a collection of substrings of symbols 
selected from the preliminary search term that begins with 
one of the symbols in Che preliminary search term and 
extends to a length of either the end of the preliminary 
search term or the number of symbols of the predetermined 
step size, whichever is smaller. 

20. The mediod of claim 18. further comprising the step 
of examining each object in the proposed list of references 
from the content-index portion to determine whether the 
object also satisfies the direct search pCHtion. 

21. The method of claim 18 wherein the collection of 
objects is a plurality of documents. 

22. The method of claim 18. further con^)rising the step 
of: 

for each preliminary search term of the second type, 
setting the direct search indicator if the preliminary 
search term is longer than a predetermined length, and 

creating a search term conqrising the preliminary search 
term appended with a wildcard character if the prelimi- 
nary search term is not longer than the predetermined 
length. 

23. A computer system for generating a search result that 
identifies objects that satisfy a search criteria, the computer 
system storing a collection of objects and a plurality of 
terms, each object containing one or more of the terms, the 
objects being represented in dififerent types of symbols in a 
compound word language such as Japanese or Qiinese. 
comprising: 

a content-index Aat contains, for each of the plurality of 
terms, a reference to each object that contains the term; 

a preliminary index term generator that generates, for 
each plurality of terms delimited by a word separator or 
a character type transition, a preliminary search term of 
a first or second type of symbol; 

an indexer that, for each preliminary index term of the 
first type, utilizes the preliminary index term as an 
index term; 

the indexer. for each preliminary index term of die second 
type, also step indexing the symbols in the preliminary 
index term to create a plurality of index terms of a 
length equal to or less than a predetermined step size, 
the plurality of index terms comprising a collection of 
substrings of symbols selected from the preliminary 
index term that begins with one of the symbols in the 
preliminary index term and extends to a lengdi of either 
the end of the preliminary index term or the numbcx of 
symbols of the predetermined step size* whichever is 
smaller; 

an object/Index term associates' that creates tiie content- 
index by associating the object with each of its index 
terms; and 

a search engine that, after creating the content-index, uses 
the content-index to generate the search result. 

24. The system of claim 23. further comprising a normal- 
izer that normalizes any two-byte rq)rescntations of symbols 
coniprising an index term to a single-byte representation. 

25. The system of claim 24. wherein the normalizing is 
carried out on the preliminary index terms. 

26. The system of claim 23. further conquising a text 
buffer, and wherein the object to be indexed is stored in the 
text buffer, and the preliminary index terms are stored in a 
key buffer. 

27. The system of claim 26, wherein the preliminary index 
terms are represented by a plurality of key buffer entries. 
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wherein each key buffer entry comprises a tuple containing 
a key length parameter and a key offset parameter relative to 
the text buffer. 

28. The system of claim 23. wherein the preliminary index 
5 term of the first type is katakana in a shift-JIS representation. 

29. The system of claim 23. wherein the preliminary index 
« term of the first type is roman in a shifr-JIS rejM-esentation. 

30. The system of claim 23, wherein the preliminary index 
term of the second type is kanji in a shift-JIS rqn^esentation. 

10 31. The system of claim 23, wherein the indexer is 
operative for: 

(a) creating an index term of a length equal to or less than 
the predetermined step size, beginning with the first 
synitx)l in the preliminary index term of the second 

^3 type and extending for the predetermined step size. 

(b) eliminating the first synUx>l in the preliminary index 
term to create a reduced preliminary index term. 

(c) creating an index term of a length equal to predeter- 
^ mined step size, beginning v^ith the first symbol in the 

reduced preliminary index term of the second type and 
extending for the predetermined step size, 

(d) repeating the steps (b) and (c) until the last symbol in 
the reduc^ preliminary index term coincides with the 

25 last symbol in preliminary index term; 

(e) decrementing the predetermined step size; 

(f) creating an index term of a length equal to the 
decremented stq> size; and 

(g) repeating the steps (e) and (f) until creating a last index 
^ term coursing the last symbol in the preliminary 

index term. 

3Z The system of claim 23 wherein the collection of 
objects con^>rises a corpus of documents. 

33. A computer system for generating a search result that 
identifies objects that satisfy a search criteria, the computer 
system storing a collection of objects in a compound word 
language such as Japanese or Chinese and a plurality of 
terms, each object containing one or more of the terms, and 
an index associating terms and objects, comprising the: 

^ an input device for providing a string of text as a pre- 
liminary search string; 
a preliminary search term generator that generates a kanji 
preliminary search term for eadi plurality of kanji 
terms in the preliminary search string delimited by a 
word separator or a character type transition; 
a search term generator that provides, for each kanji 
preliminary search term, a plurality of search terms of 
a length equal to or less than a predetermined step size 

^ by step indexing the symbols in the preliminary kanji 
search term. 

the plurality of search terms comprising a collecticHi of 
substrings of symbols selected from the preliminary 
kanji search term that begins with one of the symbols 
55 in die preliminary kanji search term and extends to a 
length of either the end of the preliminary kanji search 
term or the number of symbols of the predetermined 
step size, whichever is smaller; and 
a search engine that uses the search terms in the index to 
60 generate the search result. 

34. The system of claim 33. further comprising a filter for 
removing Boolean terms from the preliminary search string 
prior to the step of aeating the search terms. 

35. The system of claim 33, further conqnising a nonnal- 
65 izer for normalizing any two-byte re[^sentations of sym- 
bols comprising a seardi term to a single-byte representa- 
tion. 
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36. The system of claim 33, wherein the iodexer is 
operative for step indexing by: 

(a) creating a search term of a length equal to or less than 
the predetermined step size, beginning with the first 
symbol in the kanji preliminary seardi term of the ^ 
second type and extending for &e predetenmned step 
size, 

(b) eliminating the first symbol in the kanji preliminary 
search term to create a reduced kanji preliminary search 
term, 

(c) creating a search term of a length equal to predeter- 
mined step size, beginning with the first symbol in the 
reduced kanji preliminary search term of the second 
type and extending for the predetermined step size. 

(d) repeating the steps (b) and (c) until the last symbol in 
the reduced kanji preliminary search term coincides 
widi the last syinbol in kanji preliminary search term; 

(e) decrementing the predetermined step size; 

(f) creating a search term of a length equal to the decre- ^ 
mented step size; and 

(g) repeating the steps (e) and (f) until creating a last 
search term comprising the last symbol in the kanji 
preliminary search term. 

37. A computer system for {^^ovidlng a search result that ^ 
identifies objects n a compound word language such as 
Japanese or Chinese that satisfy a search criteria, the objects 
contained in a collection of objects, the seardi criteria 
having a content-index search portion used with a content- 
index to determine a set of objects of the collection that ^ 
satisfy the content-index search potion, the search criteria 
having a direct seardi portion, the direct search portion 
further restricting flie set of objects that satisfy the content- 
index search portion in order to satisfy the search criteria, 
comftfising: 

an input device for providing a string of text in a com- 
pound word language sudi as Japanese or Chinese as a 
preliminary search string, the compound word lan- 
guage having symbols of a fint type such as kanji, ^ 
katakana, and roman and symbols of a second type such 
as hiragana; 

a preliminary search term generator for creating a pre- 
liminary search term for each plurality of terms in the 
preliminaiy search string delimited by a wcurd separator 45 
or a character type transition; 

a search term generator that, for each preliminaiy search 
term of the first type, utilizes the preliminary search 
term as a search term in the search criteria; 
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the search term generator, f<x each preliminary search 
term of the second type, setting a direct search indica- 
tor; 

a search engine that, in response to the direct search 
indicator, generates a proposed list of references to 
objects that satisfy the direct search portion of the 
search criteria by directly searching the collection of 
objects with the preliminary search term of the second 
type; 

die search engine also generating a proposed list of 
references to objects that satisfy the content-index 
portion of the search criteria by searching the content- 
index with the search term of the first type; and 

an output device that provides the seardi results by listing 
the collection of objects ttiat match the search criteria 
of the content-index searching and of the direct seardi- 
ing. 

38, The system of claim 37, further comprising a step 
indexer for providing the symbols in the preliminary search 
term as a plurality of search terms of a length less than a 
predetermined step size, the plurality of search terms com- 
prising a collection of substrings of symbols selected from 
the preliminary search term that begins with one of the 
symbols in the preliminary search term and extends to a 
length either the end of the preliminary search term or the 
number erf symbols of the predetermined step size, which- 
ever is smaller. 

39, The system of claim 37, further contusing a com- 
ponent in the search engine for examining each object in the 
proposed list of references from the content-index portion to 
determine whether the object also satisfies the direct search 
portion. 

40. The system of claim 37. wherein the collection of 
objects is a plurality of documents. 

41. The system of daim 37. wherein the search term 
generator is operative for; 

for each fweliminary search term of the second type, 
setting the dfrect search indicator if the preliminary 
search term Is longer than a predetermined length, and 

creating a seardi term comprising the preliminary search 
term s^^nded with a wildcard character if the prelimi- 
nary search term is not longer than the predetermined 
length. 
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